Fix: gpu detection when multiple GPUs exist#275
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors GPU detection and handling to support systems with multiple GPUs by replacing a boolean GPU flag with a more detailed GPU specifications structure. Key changes include:
- Replacing the boolean has_gpu flag with an Option throughout the Docker service and CLI layers.
- Updating the DockerManager to use the GPU specifications for device requests.
- Modifying GPU detection logic to collect and select multiple GPU devices.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| worker/src/docker/service.rs | Switched from a boolean flag to an Option for GPU usage. |
| worker/src/docker/docker_manager.rs | Updated the start_container parameter and host configuration to use GPU specs. |
| worker/src/cli/command.rs | Adjusted CLI command construction to reflect the new GPU model. |
| worker/src/checks/hardware/hardware_check.rs | Changed GPU collection to select the main GPU from multiple GPUs. |
| worker/src/checks/hardware/gpu.rs | Redesigned GPU detection to aggregate multiple GPU devices. |
| shared/src/models/node.rs | Added an optional indices field to the GpuSpecs model. |
| discovery/src/api/routes/node.rs | Updated test fixtures to include GPU indices. |
Comments suppressed due to low confidence (2)
worker/src/docker/docker_manager.rs:193
- [nitpick] Consider renaming the variable 'gpu' to 'gpu_specs' for clarity, as it represents detailed GPU specifications rather than a simple boolean flag.
let host_config = if gpu.is_some() {
worker/src/docker/docker_manager.rs:200
- [nitpick] Instead of using unwrap, consider using pattern matching (e.g., if let Some(gpu_specs) = gpu) to safely access 'indices'. This avoids any potential panic and makes the control flow clearer if 'indices' is unexpectedly None.
device_ids: gpu.unwrap().indices.map(|indices| indices.iter().map(|i| i.to_string()).collect()),
| let name = device.name().unwrap_or_else(|_| "Unknown".to_string()); | ||
| let memory = device.memory_info().map(|m| m.total).unwrap_or(0); | ||
| let driver_version = nvml | ||
| .sys_driver_version() | ||
| .unwrap_or_else(|_| "Unknown".to_string()); | ||
|
|
||
| GpuDevice::Available { | ||
| name, | ||
| memory, | ||
| driver_version, | ||
| device_count, | ||
| if let Some(existing_device) = device_map.get_mut(&name) { | ||
| existing_device.count += 1; | ||
| existing_device.indices.push(i as u32); | ||
| } else { | ||
| device_map.insert( | ||
| name.clone(), | ||
| GpuDevice { | ||
| name, | ||
| memory, | ||
| driver_version, | ||
| count: 1, | ||
| indices: vec![i as u32], | ||
| }, | ||
| ); | ||
| } |
There was a problem hiding this comment.
I looked up a few SMI examples and it seems like the "name" field is usually unique to a product, but also this doesn't seem to be guaranteed, e.g. with VRAM modded cards (that might get more popular in the future) it could cause errors not to check the VRAM available. It might be worthwhile appending the memory as a string to the name and using that as a key.
*use GPU type with highest count * mount specific GPU idx and not all
*use GPU type with highest count * mount specific GPU idx and not all
No description provided.